22 - Deep Learning - Common Practices Part 1 [ID:16898]

50 von 99 angezeigt

Welcome everybody to today's deep learning lecture. Today we want to talk a bit about

common practices, the stuff that you need to know to get everything implemented in practice.

So I have a small outline over in the next couple of videos and the topics we will look

at. So we will think about the problems that we currently have and how far we went. Then

we talk about training strategies, again optimization and learning rate and a couple of tricks how

to adjust them. Architecture selection and hyperparameter optimization. One trick that

is really useful is ensembling and typically people have to deal with class imbalance and

of course there's also very interesting approaches how to deal with them. So finally we look

into the evaluation and how to get a good predictor how well our network is actually

performing. Okay so far we have seen all the nuts and bolts of how to train the network.

We have the fully connecting convolutional layers, we have the activation function, the

loss function, optimization, regularization and today we will talk about how to choose

the architecture, train and evaluate a deep neural network. And the very first thing is

test data. Test data goes into the vault. Ideally the test set should be kept in a vault

and be brought only out at the end of the data analysis as Hasty and colleagues are

teaching in the elements of statistical learning. So first things first, overfitting is extremely

easy with neural networks. Again ImageNet random labels. So true test set error and

generalization can be underestimated substantially when you use the test set for model selection.

Not a good idea. So when we choose the architecture that's typically the first element in the

model selection and this should never be done on the test set. So we can do initial experimentation

on a smaller subset of the data, try to figure out what works but never work on the test

set when you're doing these things of selecting the architecture. Okay so let's look at a

couple of training strategies. Before the training check your gradients, check the loss

function, check own layer implementations that they compute correctly. And if you implemented

your own layer then compare the analytic and the numerical gradient. And you can use the

center differences for the numeric gradient, then you can use relative errors instead of

absolute differences and consider the numerics. Use double precision for checking, temporarily

scale the loss function if you observe very small values and choose your age for the step

size appropriately. Then we have a couple of additional recommendations. If you only

use a few data points then you will have less issues with non-differentiable parts of the

loss function. You can train the network for a short period of time and only then perform

the gradient checks. You can check the gradient first, then with regularization terms. So

if you first turn the regularization terms off, check the gradient and then with the

regularization terms and you turn off data augmentation and dropout. So you typically

make this checks on rather small data sets. So the goal of the initialization is that

you have a correct random initialization of the layers so you can compute the loss for

each class on the untrained network with regularization turned off. And of course that should give

a random classification within untrained network. So you then can compare the loss with the

loss achieved when deciding for a class randomly and they should be the same because you randomly

initialize and then you repeat with multiple random initializations just to check that

there's nothing wrong with the initialization. Let's go to the training. First you check

whether the architecture is in general capable of learning the task. So before training the

network on the full data set you take a small subset of the data, maybe 5 to 20 samples

and then try to overfit the network to get a zero loss. With such few samples you should

be able to memorize the entire data set and try to get a zero loss. So then you know that

your training procedure actually works and you can really go down to the zero loss and

optionally you can turn off the regularization because it may hinder this overfitting procedure.

Now if the network can't overfit you may have a bug in the implementation. Your model may

be too small so you may want to increase the parameters or the model capacity or simply

the model may not be suitable for this task. Also get a first idea about how the data,

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:10:53 Min

Aufnahmedatum

2020-05-31

Hochgeladen am

2020-05-31 22:56:35

Sprache

en-US

Deep Learning - Common Practices Part 1

This video discusses the use of validation data and how to choose optimizers, monitor weights, and set learning rates including their annealing.

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren